Search CORE

9 research outputs found

Strategies for improving low resource speech to text translation relying on pre-trained ASR models

Author: Ciuba Alejandro
Kesiraju Santosh
Macaire Cecile
Pavlicek Tomas
Sarvas Marek
Publication venue
Publication date: 31/05/2023
Field of study

This paper presents techniques and findings for improving the performance of low-resource speech to text translation (ST). We conducted experiments on both simulated and real-low resource setups, on language pairs English - Portuguese, and Tamasheq - French respectively. Using the encoder-decoder framework for ST, our results show that a multilingual automatic speech recognition system acts as a good initialization under low-resource scenarios. Furthermore, using the CTC as an additional objective for translation during training and decoding helps to reorder the internal representations and improves the final translation. Through our experiments, we try to identify various factors (initializations, objectives, and hyper-parameters) that contribute the most for improvements in low-resource setups. With only 300 hours of pre-training data, our model achieved 7.3 BLEU score on Tamasheq - French data, outperforming prior published works from IWSLT 2022 by 1.6 points

arXiv.org e-Print Archive

An Empirical Evaluation of Zero Resource Acoustic Unit Discovery

Author: Burget Lukas
Dehak Najim
Ghahremani Pegah
Kesiraju Santosh
Khudanpur Sanjeev
Liu Chunxi
Ondel Lucas
Rott Alena
Sun Ming
Yang Jinyi
Publication venue
Publication date: 04/02/2017
Field of study

Acoustic unit discovery (AUD) is a process of automatically identifying a categorical acoustic unit inventory from speech and producing corresponding acoustic unit tokenizations. AUD provides an important avenue for unsupervised acoustic model training in a zero resource setting where expert-provided linguistic knowledge and transcribed speech are unavailable. Therefore, to further facilitate zero-resource AUD process, in this paper, we demonstrate acoustic feature representations can be significantly improved by (i) performing linear discriminant analysis (LDA) in an unsupervised self-trained fashion, and (ii) leveraging resources of other languages through building a multilingual bottleneck (BN) feature extractor to give effective cross-lingual generalization. Moreover, we perform comprehensive evaluations of AUD efficacy on multiple downstream speech applications, and their correlated performance suggests that AUD evaluations are feasible using different alternative language resources when only a subset of these evaluation resources can be available in typical zero resource applications.Comment: 5 pages, 1 figure; Accepted for publication at ICASSP 201

arXiv.org e-Print Archive

Crossref

Analysis of BUT-PT Submission for NIST LRE 2017

Author: Burget Lukáš
Cumani Sandro
Diez Mireia
Glembek Ondřej
Grézl František
Kamsali Mounika
Kesiraju Santosh
Lozano-Diez Alicia
Matějka Pavel
Novotný Ondřej
Ondel Lucas
Plchot Oldřich
Rohdin Johan
Silnova Anna
Slavíček Josef
Publication venue: 'International Speech Communication Association'
Publication date: 01/01/2018
Field of study

Crossref

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

BAT System Description for NIST LRE 2015

Author: Brummer Niko
Burget Lukas
Cumani Sandro
Fer Radek
Glembek Ondrej
Grezl Frantisek
Karafiat Martin
Kesiraju Santosh
Li Ruizhi
Mallidi Sri Harish
Matejka Pavel
Novotny Ondrej
Ondel Lucas
Pesan Jan
Plchot Oldrich
Swart Albert
Vesely Karel
Publication venue: ISCA
Publication date
Field of study

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

Zero-Time Windowing Cepstral Coefficients for Dialect Classification

Author: Gangashetty Suryakanth
Kadiri Sudarsana
Kesiraju Santosh
Kethireddy Rashmi
Publication venue: 'International Speech Communication Association'
Publication date: 01/01/2020
Field of study

In this paper, we propose to use novel acoustic features, namely zero-time windowing cepstral coefficients (ZTWCC) for dialect classification. ZTWCC features are derived from high resolution spectrum obtained with zero-time windowing (ZTW) method, and were shown to be useful for discriminating speech sound characteristics effectively as compared to a DFT spectrum. Our proposed system is based on i-vectors trained on static and shifted delta coefficients of ZTWCC. The i-vectors are further whitened before classification. The proposed system is compared with i-vector baseline system trained on Mel frequency cepstral coefficient (MFCC) features. Classification results on STYRIALECT database (German) and UT-Podcast (English) database revealed that the system with proposed features outperformed aforementioned baseline system. Our detailed experimental analysis on dialect classification shows that the i-vector system can indeed exploit high spectral resolution of ZTWCC and hence performed better than MFCC features based system.Peer reviewe

Aaltodoc Publication Archive

BUT-PT System Description for NIST LRE 2017

Author: Burget Lukáš
Cumani Sandro
Diez Mireia
Glembek Ondřej
Grezl Frantisek
Kamsali Mounika
Kesiraju Santosh
Lozano-Diez Alicia
Matejka Pavel
Novotny Ondrej
Ondel Lucas
Plchot Oldřich
Rohdin Johan
Silnova Anna
Slavicek Josef
Publication venue: HAL CCSD
Publication date: 01/01/2017
Field of study

Hal-Diderot

Automatic Processing Pipeline for Collecting and Annotating Air-Traffic Voice Communication Data

Author: Alexander Blatt
Allan Tart
Amrutha Prasad
Chloe Salamin
Claudia Cevenini
Dietrich Klakow
Fabian Landis
Hicham Atassi
Igor Szöke
Iuliia Nigmatulina
Jan Černocký
Juan Zuluaga-Gomez
Karel Veselý
Khalid Choukri
Martin Kocour
Mickael Rigault
Pavel Kolčárek
Petr Motlíček
Saeed Sarfjoo
Santosh Kesiraju
Publication venue: 'MDPI AG'
Publication date: 31/12/2021
Field of study

This document describes our pipeline for automatic processing of ATCO pilot audio communication we developed as part of the ATCO2 project. So far, we collected two thousand hours of audio recordings that we either preprocessed for the transcribers or used for semi-supervised training. Both methods of using the collected data can further improve our pipeline by retraining our models. The proposed automatic processing pipeline is a cascade of many standalone components: (a) segmentation, (b) volume control, (c) signal-to-noise ratio filtering, (d) diarization, (e) ‘speech-to-text’ (ASR) module, (f) English language detection, (g) call-sign code recognition, (h) ATCO—pilot classification and (i) highlighting commands and values. The key component of the pipeline is a speech-to-text transcription system that has to be trained with real-world ATC data; otherwise, the performance is poor. In order to further improve speech-to-text performance, we apply both semi-supervised training with our recordings and the contextual adaptation that uses a list of plausible callsigns from surveillance data as auxiliary information. Downstream NLP/NLU tasks are important from an application point of view. These application tasks need accurate models operating on top of the real speech-to-text output; thus, there is a need for more data too. Creating ATC data is the main aspiration of the ATCO2 project. At the end of the project, the data will be packaged and distributed by ELDA

Multidisciplinary Digital Publishing Institute